Elsőként az Iris adathalmazt választottam. Programozási nyelvnek pedig az R-t azon belül ay H2O csomagot. Egy picit ‘overkill’ a feladathoz de hasznosnak találtam kipróbálni mivel munkában osztott rendszeren dolgozunk(Hadoop) és H2O-val lehet HDFS-ben tárolt nagyobb adathalmazokat is feldolgozni(összekötöttem a hasznost a hasznossal). Hátránya is van: kevés modell van implementálva. Például az SVM hiánzyik mivel nehezebben párhuzamositható osztott rendszeren.

Az iris adathalmaz kicsi viszont szerintem egész jól lehet majd mérni az osztályozók pontosságát mivel az egyik osztály elválasztható a másik 2től lineárisan, de az utóbbi 2 nem szétválasztható.(lásd alábbi ábra 3 változóval)

library(plotly)
plot_ly(iris,x=~Petal.Length,y=~Sepal.Length, z=~Petal.Width, color = ~Species, type="scatter3d",marker = list(opacity=0.5))

Az adatok

Csupán 4 valós változó van a virágok szirmainak méréseivel valamint a virág tipusa. Előfeldolgozást sem igényel az adathalmaz. A tipus átalakitható “one hot encoding” modszerrel ha éppenséggel valamelyik mérést szeretnénk becsülni a többi paraméter alapján. Az adathalmaz az R környezet része, nem volt szükség letölteni.

library(h2o)
h2o.init() #kapcsolódni-elinditani a szervert
irisdf <-as.h2o(iris) #betolteni az adatokat  a szerverbe
summary(irisdf)
isplit<-h2o.splitFrame(irisdf,ratios = 4/5, destination_frames = c("train","test"),seed = 1) #felosztas
itrain <- isplit[[1]]
itest <- isplit[[2]]
summary(itrain)
summary(itest)

A split nem tökéletesen arányos mivel nagy adathalmazokra van kitalálva és arra működik jól.

Az osztályozók

A 3 osztályozó amit választottam: Naive Bayes, Neuralis háló és Random forest .

Bayes

bayes <- h2o.naiveBayes(x=1:4,y=5,itrain)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |=================================================================| 100%
bayes@model$training_metrics
## H2OMultinomialMetrics: naivebayes
## ** Reported on training data. **
## 
## Training Set Metrics: 
## =====================
## 
## Extract training frame with `h2o.getFrame("train")`
## MSE: (Extract with `h2o.mse`) 0.03200071
## RMSE: (Extract with `h2o.rmse`) 0.1788874
## Logloss: (Extract with `h2o.logloss`) 0.1098575
## Mean Per-Class Error: 0.04126016
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: vertical: actual; across: predicted
##            setosa versicolor virginica  Error      Rate
## setosa         45          0         0 0.0000 =  0 / 45
## versicolor      0         37         3 0.0750 =  3 / 40
## virginica       0          2        39 0.0488 =  2 / 41
## Totals         45         39        42 0.0397 = 5 / 126
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-3 Hit Ratios: 
##   k hit_ratio
## 1 1  0.960317
## 2 2  1.000000
## 3 3  1.000000
bayes@model$apriori
## A Priori Response Probabilities: 
##     setosa versicolor virginica
## 1 0.357143   0.317460  0.325397
bayes@model$pcond
## [[1]]
## Sepal.Length: 
##   y_by_sepallength     mean  std_dev
## 1           setosa 5.006667 0.363318
## 2       versicolor 5.902500 0.546545
## 3        virginica 6.595122 0.621672
## 
## [[2]]
## Sepal.Width: 
##   y_by_sepalwidth     mean  std_dev
## 1          setosa 3.437778 0.394444
## 2      versicolor 2.760000 0.324867
## 3       virginica 2.975610 0.287037
## 
## [[3]]
## Petal.Length: 
##   y_by_petallength     mean  std_dev
## 1           setosa 1.460000 0.177610
## 2       versicolor 4.220000 0.483682
## 3        virginica 5.558537 0.566558
## 
## [[4]]
## Petal.Width: 
##   y_by_petalwidth     mean  std_dev
## 1          setosa 0.255556 0.105649
## 2      versicolor 1.322500 0.213022
## 3       virginica 2.036585 0.277269
perfbayes <- h2o.performance(bayes,itest)
h2o.confusionMatrix(perfbayes)
## Confusion Matrix: vertical: actual; across: predicted
##            setosa versicolor virginica  Error     Rate
## setosa          5          0         0 0.0000 =  0 / 5
## versicolor      0         10         0 0.0000 = 0 / 10
## virginica       0          1         8 0.1111 =  1 / 9
## Totals          5         11         8 0.0417 = 1 / 24

Neuralis halo

nn <- h2o.deeplearning(x=1:4,y=5,itrain,hidden = c(10),epochs = 1000,diagnostics=TRUE,variable_importances = TRUE,export_weights_and_biases=TRUE) #egy 10es rejtett reteg
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===================================================              |  78%
  |                                                                       
  |=================================================================| 100%
nn@model$training_metrics
## H2OMultinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on full training frame **
## 
## Training Set Metrics: 
## =====================
## 
## Extract training frame with `h2o.getFrame("train")`
## MSE: (Extract with `h2o.mse`) 0.01527013
## RMSE: (Extract with `h2o.rmse`) 0.1235724
## Logloss: (Extract with `h2o.logloss`) 0.04608933
## Mean Per-Class Error: 0.01646341
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: vertical: actual; across: predicted
##            setosa versicolor virginica  Error      Rate
## setosa         45          0         0 0.0000 =  0 / 45
## versicolor      0         39         1 0.0250 =  1 / 40
## virginica       0          1        40 0.0244 =  1 / 41
## Totals         45         40        41 0.0159 = 2 / 126
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-3 Hit Ratios: 
##   k hit_ratio
## 1 1  0.984127
## 2 2  1.000000
## 3 3  1.000000
nn@model$model_summary
## Status of Neuron Layers: predicting Species, 3-class classification, multinomial distribution, CrossEntropy loss, 83 weights/biases, 4.0 KB, 126,000 training samples, mini-batch size 1
##   layer units      type dropout       l1       l2 mean_rate rate_rms
## 1     1     4     Input  0.00 %                                     
## 2     2    10 Rectifier  0.00 % 0.000000 0.000000  0.177420 0.367138
## 3     3     3   Softmax         0.000000 0.000000  0.466268 0.485801
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1                                                   
## 2 0.000000   -0.125403   0.744704  0.525238 0.530411
## 3 0.000000   -0.673042   2.073248 -0.318591 0.525782
h2o.varimp_plot(nn)

h2o.weights(nn,matrix_id = 1)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1    0.1368192   0.2493647   -0.7523624  -0.3132385
## 2    0.5641928   0.6266022    0.3723520   0.1660877
## 3   -0.4600934   0.6264148   -1.6257826  -0.7909308
## 4   -0.4742610  -0.3957320    0.7085864  -0.3115121
## 5    0.4626494  -0.1129947   -0.5199866  -0.9834628
## 6   -0.2412732   0.3762394   -1.6812392  -2.4659739
## 
## [10 rows x 4 columns]
h2o.biases(nn,vector_id = 1)
##           C1
## 1 1.04264398
## 2 1.14371285
## 3 0.01169682
## 4 0.12603429
## 5 1.25553962
## 6 0.19720104
## 
## [10 rows x 1 column]
perfnn <- h2o.performance(nn,itest)
h2o.confusionMatrix(perfnn)
## Confusion Matrix: vertical: actual; across: predicted
##            setosa versicolor virginica  Error     Rate
## setosa          5          0         0 0.0000 =  0 / 5
## versicolor      0         10         0 0.0000 = 0 / 10
## virginica       0          0         9 0.0000 =  0 / 9
## Totals          5         10         9 0.0000 = 0 / 24

Random forest

rf <- h2o.randomForest(x=1:4,y=5,itrain, ntrees = 40) #20 fabol allo RF
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
rf@model$training_metrics
## H2OMultinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
## 
## Training Set Metrics: 
## =====================
## 
## Extract training frame with `h2o.getFrame("train")`
## MSE: (Extract with `h2o.mse`) 0.03224648
## RMSE: (Extract with `h2o.rmse`) 0.179573
## Logloss: (Extract with `h2o.logloss`) 0.1214671
## Mean Per-Class Error: 0.04939024
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: vertical: actual; across: predicted
##            setosa versicolor virginica  Error      Rate
## setosa         45          0         0 0.0000 =  0 / 45
## versicolor      0         37         3 0.0750 =  3 / 40
## virginica       0          3        38 0.0732 =  3 / 41
## Totals         45         40        41 0.0476 = 6 / 126
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-3 Hit Ratios: 
##   k hit_ratio
## 1 1  0.952381
## 2 2  1.000000
## 3 3  1.000000
rf@model$model_summary
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1              40                      120               15697         1
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         8    3.36667          2         13     5.45000
h2o.varimp_plot(rf)

perfrf <- h2o.performance(rf,itest)
h2o.confusionMatrix(perfrf)
## Confusion Matrix: vertical: actual; across: predicted
##            setosa versicolor virginica  Error     Rate
## setosa          5          0         0 0.0000 =  0 / 5
## versicolor      0         10         0 0.0000 = 0 / 10
## virginica       0          1         8 0.1111 =  1 / 9
## Totals          5         11         8 0.0417 = 1 / 24
gendata <- function(n,params){
  sl1 <- params[[1]][1,] #s-sepal,p-petal l-lenght w-width 1-setosa
  sw1 <- params[[2]][1,]
  pl1 <- params[[3]][1,]
  pw1 <- params[[4]][1,]
  dfsetosa<-cbind(rnorm(n=n,mean = sl1["mean"]$mean,sd = sl1["std_dev"]$std_dev),
                  rnorm(n=n,mean = sw1["mean"]$mean,sd = sw1["std_dev"]$std_dev),
                  rnorm(n=n,mean = pl1["mean"]$mean,sd = pl1["std_dev"]$std_dev),
                  rnorm(n=n,mean = pw1["mean"]$mean,sd = pw1["std_dev"]$std_dev),
                  rep("setosa",times=n))
  dfsetosa <- as.data.frame(dfsetosa)
  names(dfsetosa)<-c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species")
  
  sl1 <- params[[1]][2,] #2-versicolor
  sw1 <- params[[2]][2,]
  pl1 <- params[[3]][2,]
  pw1 <- params[[4]][2,]
  dfversicolor<-cbind(rnorm(n=n,mean = sl1["mean"]$mean,sd = sl1["std_dev"]$std_dev),
                  rnorm(n=n,mean = sw1["mean"]$mean,sd = sw1["std_dev"]$std_dev),
                  rnorm(n=n,mean = pl1["mean"]$mean,sd = pl1["std_dev"]$std_dev),
                  rnorm(n=n,mean = pw1["mean"]$mean,sd = pw1["std_dev"]$std_dev),
                  rep("versicolor",times=n))
  dfversicolor<- as.data.frame(dfversicolor)
  names(dfversicolor)<-c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species")
  
  sl1 <- params[[1]][3,] #3-virginica
  sw1 <- params[[2]][3,]
  pl1 <- params[[3]][3,]
  pw1 <- params[[4]][3,]
  dfvirginica<-cbind(rnorm(n=n,mean = sl1["mean"]$mean,sd = sl1["std_dev"]$std_dev),
                  rnorm(n=n,mean = sw1["mean"]$mean,sd = sw1["std_dev"]$std_dev),
                  rnorm(n=n,mean = pl1["mean"]$mean,sd = pl1["std_dev"]$std_dev),
                  rnorm(n=n,mean = pw1["mean"]$mean,sd = pw1["std_dev"]$std_dev),
                  rep("virginica",times=n))
  dfvirginica<- as.data.frame(dfvirginica)
  names(dfvirginica)<-c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species")
  
  
  rbind(dfsetosa,dfversicolor,dfvirginica,stringsAsFactors=FALSE)
}

testd <- gendata(n=200,bayes@model$pcond) 
testd$Sepal.Length <- as.double(as.character(testd$Sepal.Length))
testd$Petal.Length <- as.double(as.character(testd$Petal.Length))
testd$Sepal.Width <- as.double(as.character(testd$Sepal.Width))
testd$Petal.Width <- as.double(as.character(testd$Petal.Width))
plot_ly(testd,x=~Petal.Length,y=~Sepal.Length, z=~Petal.Width, color = ~Species,marker = list(opacity=0.5))
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
td<-as.h2o(testd)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
h2o.performance(nn,td)
## H2OMultinomialMetrics: deeplearning
## 
## Test Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.04139032
## RMSE: (Extract with `h2o.rmse`) 0.2034461
## Logloss: (Extract with `h2o.logloss`) 0.1566785
## Mean Per-Class Error: 0.055
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: vertical: actual; across: predicted
##            setosa versicolor virginica  Error       Rate
## setosa        200          0         0 0.0000 =  0 / 200
## versicolor      0        188        12 0.0600 = 12 / 200
## virginica       0         21       179 0.1050 = 21 / 200
## Totals        200        209       191 0.0550 = 33 / 600
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-3 Hit Ratios: 
##   k hit_ratio
## 1 1  0.945000
## 2 2  1.000000
## 3 3  1.000000
h2o.performance(bayes,td)
## H2OMultinomialMetrics: naivebayes
## 
## Test Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.01198691
## RMSE: (Extract with `h2o.rmse`) 0.1094847
## Logloss: (Extract with `h2o.logloss`) 0.04045009
## Mean Per-Class Error: 0.01666667
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: vertical: actual; across: predicted
##            setosa versicolor virginica  Error       Rate
## setosa        200          0         0 0.0000 =  0 / 200
## versicolor      0        193         7 0.0350 =  7 / 200
## virginica       0          3       197 0.0150 =  3 / 200
## Totals        200        196       204 0.0167 = 10 / 600
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-3 Hit Ratios: 
##   k hit_ratio
## 1 1  0.983333
## 2 2  1.000000
## 3 3  1.000000
h2o.performance(rf,td)
## H2OMultinomialMetrics: drf
## 
## Test Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.02725207
## RMSE: (Extract with `h2o.rmse`) 0.165082
## Logloss: (Extract with `h2o.logloss`) 0.1012265
## Mean Per-Class Error: 0.035
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: vertical: actual; across: predicted
##            setosa versicolor virginica  Error       Rate
## setosa        200          0         0 0.0000 =  0 / 200
## versicolor      0        189        11 0.0550 = 11 / 200
## virginica       0         10       190 0.0500 = 10 / 200
## Totals        200        199       201 0.0350 = 21 / 600
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-3 Hit Ratios: 
##   k hit_ratio
## 1 1  0.965000
## 2 2  1.000000
## 3 3  1.000000
testd<-cbind(testd,as.data.frame(h2o.predict(nn,td)))
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
testd$val <- ifelse(testd$Species == testd$predict,"+","-")
plot_ly(testd,x=~Petal.Length,y=~Sepal.Length, z=~Petal.Width, color = ~val,colors=c("red","green"),marker = list(opacity=0.2),text=~paste(Species," ",predict))
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
nn <- h2o.deeplearning(x=2:4,y=5,irisdf,hidden = c(10),epochs = 1000,diagnostics=TRUE,variable_importances = TRUE,export_weights_and_biases=TRUE) #egy 10es rejtett reteg
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
nn@model$training_metrics
## H2OMultinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on full training frame **
## 
## Training Set Metrics: 
## =====================
## 
## Extract training frame with `h2o.getFrame("iris")`
## MSE: (Extract with `h2o.mse`) 0.01423088
## RMSE: (Extract with `h2o.rmse`) 0.1192933
## Logloss: (Extract with `h2o.logloss`) 0.04304078
## Mean Per-Class Error: 0.02
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: vertical: actual; across: predicted
##            setosa versicolor virginica  Error      Rate
## setosa         50          0         0 0.0000 =  0 / 50
## versicolor      0         48         2 0.0400 =  2 / 50
## virginica       0          1        49 0.0200 =  1 / 50
## Totals         50         49        51 0.0200 = 3 / 150
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-3 Hit Ratios: 
##   k hit_ratio
## 1 1  0.980000
## 2 2  1.000000
## 3 3  1.000000
#nn@model$model_summary
h2o.varimp_plot(nn)

rf <- h2o.randomForest(x=1:4,y=5,itrain, ntrees = 40) #20 fabol allo RF
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=============================                                    |  45%
  |                                                                       
  |=================================================================| 100%
rf@model$training_metrics
## H2OMultinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
## 
## Training Set Metrics: 
## =====================
## 
## Extract training frame with `h2o.getFrame("train")`
## MSE: (Extract with `h2o.mse`) 0.04175959
## RMSE: (Extract with `h2o.rmse`) 0.2043516
## Logloss: (Extract with `h2o.logloss`) 0.1569769
## Mean Per-Class Error: 0.04939024
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: vertical: actual; across: predicted
##            setosa versicolor virginica  Error      Rate
## setosa         45          0         0 0.0000 =  0 / 45
## versicolor      0         37         3 0.0750 =  3 / 40
## virginica       0          3        38 0.0732 =  3 / 41
## Totals         45         40        41 0.0476 = 6 / 126
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-3 Hit Ratios: 
##   k hit_ratio
## 1 1  0.952381
## 2 2  1.000000
## 3 3  1.000000
rf@model$model_summary
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1              40                      120               15503         1
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         7    3.24167          2         11     5.31667
h2o.varimp_plot(rf)

x <- seq(0.1,5,by=0.1)
y <- seq(0.1,7,by=0.1)
z <- seq(0.1,3,by=0.1)
gr<-expand.grid(Sepal.Width=x,Petal.Length=y,Petal.Width=z)
grid<-as.h2o(gr,destination_frame = "grid")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
prrf<-as.data.frame(h2o.predict(rf,grid))
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
plot_ly(gr,x=~Petal.Length,y=~Sepal.Width, z=~Petal.Width, color = prrf$predict,marker = list(opacity=0.2))
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode